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Abstract — The importance of power-law distributions is 
attributed to the fact that most of the naturally occurring 
phenomenon exhibit this distribution. While exponential 
distributions can be derived by minimizing KL-divergence 
w.r.t some moment constraints, some power law distributions 
can be derived by minimizing some generalizations of KL- 
divergence (more specifically some special cases of Csiszar /- 
divergences). Divergence minimization is very well studied in 
information theoretical approaches to statistics. In this work 
we study properties of minimization of Tsallis divergence, 
which is a special case of Csiszar /-divergence. In line with 
the work by Shore and Johnson (IEEE Trans. IT, 1981), 
we examine the properties exhibited by these minimization 
methods including the Pythagorean property. 

I. INTRODUCTION 

Shannon measure of information, also called entropy, 
is central to information theory which has wide range 
of applications spanning, communication theory, statis- 
tical mechanics, probability theory, statistical inference 
etc. [I]. It quantifies uncertainty or information that is 
associated with a discrete random variable by taking an 
average of uncertainty (Hartley information) associated 
with each state. The first generalization of this measure of 
information was suggested by Renyi 0. He replaced the 
linear averaging by K-N averages (Kolmogrov-Nagumo 
averages) and imposed additivity constraint. Havrda and 
Charvat introduced one more generalization which is 
now known as nonextensive entropy or Tsallis entropy [4|, 
AH, 0, which has been studied in statistical mechanics. 
Another important notion is that of finding the distance 
or divergence between two probability distributions. The 
information measure capturing this is KL-divergence, 
which is the directed distance between two probability 
distributions. KL-divergence is a special case of Tsallis 
divergence, which in turn is a special case of Csiszar 
/-divergence [7|. KL-divergence plays a central role in 
Kullback's minimum divergence principle, Which is a 
means of estimating the probability distribution of a sys- 
tem. It suggests the minimization of KL-divergence using 
a given prior distribution, subject to moment constraints as 
the estimation technique. Kullback's minimum divergence 
principle reduces to Jaynes maximum entropy principle 



when we use uniform distribution as the prior. Kull- 
back's minimum divergence principle can be extended to 
generalized divergences. When applied to classical KL- 
divergence, this yields a distribution from the exponential 
family. Whereas applying Kullback's principle to Tsallis 
divergence gives a power-law distribution. 
Exponential distributions are very important class of 
distributions and many problems have been successfully 
modeled using this [8|. Though exponential distributions 
are used in many modeling problems 1 9 1 due to theoretical 
tractability, many naturally occurring phenomena exhibit 
power-law distributions. It is of great practical and theo- 
retical interest to study both these family of distributions. 
In this work we have been able to establish many prop- 
erties for Tsallis divergence. We have established the 
property of transformation invariance and subset inde- 
pendence. In addition we have found some properties for 
Tsallis divergence minimization in classical constraints 
viz. uniqueness, reflexiveness, idempotence, invariance, 
weak subset independence and subset aggregation. In this 
work we have also attempted to derive a Pythagorean 
property. In addition we have proposed a q o 2 — q 
additive transformation for Tsallis divergence. 
The paper is organized as follows. In Section [II] we 
introduce the preliminaries and basics required for under- 
standing the results. Sections Hill through M are dedicated 
to the results and observations made. In these sections 
we perform Tsallis divergence minimization for classical 
constraints and we follow it up with the analysis of the 
properties exhibited by the same. In particular we are 
study about the Shore and Johnson properties. In the 
subsequent section we discuss about the a transformation 
relation which we established. 

II. Preliminaries and Background 

A. Exponential family and KL Divergence 

In many of the problems we might have a prior esti- 
mate of the probability distribution and given such a prior 
we are interested in finding the probability distribution 
that is closest to this prior, which also satisfies the set of 
linear constraints. To define the notion of closeness we 



need a distance measure between two distributions. One 
such distance measure is KL divergence iflOl defined as 



I{p\\r) 



xex 



r(x) 



where r is the prior. The minimization of KL-divergence 
results in a posterior which is from the exponential family. 

B. Power-Law distribution and Generalized Divergence 

/-divergence is a generalized measure of divergence, 
that was introduced by Csisiar [7] and independently by 
Ali & Silvey 1111 . Let f(t) be a real valued convex 
function defined for t > 0, with /(l) = 0. The /- 
divergence of a distribution p from r is defined by 

D f (p\\r)=J2r(x)f( P -^ 
xex v v ; 

Here we take 0/ (§) = 0,/(0) - lim^o/(*)- /" 
divergence has many important properties like non- 
negativity, monotonicity and convexity. This has been 
used in many applications like speech recognition lfl2l . 
analysis of contingency tables [7|, etc. By specializing / 
to various functions we get different divergences like KL- 
divergence, x 2 -divergence, Hellinger distance, variational 
distance, Tsallis-divergence, etc. On setting f(t) 
we get Tsallis divergence |4|. defined as 

/ g (p||r) = -5>0r)ln/ < ' ) 



tln q t 



p(x) 



where ln g is g-logarithm function [13], defined as, ln g x = 
1 (x > 0, q G K). Tsallis divergence recovers 



l-q 

KL-divergence for q — > 1 i.e., lirriq^i I q (p\\r) = I(p\\r). 
For values of q > we have I q (p\\r) > and Tsallis 
divergence becomes a convex function of both the param- 
eters. Tsallis divergence also exhibits pseudo additivity 
property, i.e., I q (Xl x X2\\Y1 x Y2) = I q (Xl\\X2)® q 
I q (Yl\\Y2), where XI and X2 are independent, so 



are Yl and Y2. Here 



is addition in g-deformed 



algebra lfl~3l defined as, x (B q y — x + y + (1 — q)xy. 
In the minimization of Tsallis divergence the choice of 
constraints play an important role lfl4l . 
Tsallis Divergence minimization with respect to q- 
expectation constraint has been studied by lfl5l . In this 
case Pythagoras theorem is established by lfl6l . ifTTl . lTT8l 
and proved in differential geometric setup by Ohara 1 19 1. 
Tsallis divergence minimization with normalized con- 
straints gives probability distribution which is self referen- 
tial in nature, i.e., p(x) depends of p(x). Here too we have 
nonextensive Pythagoras property |[T6l . IfTTl exhibited by 
Tsallis-divergence. 

In this paper we are going to study this minimization with 
respect to classical expectations, as it has the important 
property of convexity, ensuring a unique solution. 



III. Basic Shore and Johnson Properties 

Shore and Johnson [20 1 in their work in 1981 had dis- 
cussed many of the important properties of KL-divergence 
minimization. We have found that many of those proper- 
ties hold in the case of Tsallis divergence. In this section 
we shall discuss about the properties that pertain to Tsallis 
divergence, i.e., regardless of minimization. 
In this section and section [V] we shall be using the 
following notation. 

Let p be a pmf. on random variable X taking values from 
X. We would like to impose the following linear equality 
and inequality constraints on it. 



$>(x) = l , 

xex 

p(x)u m = (u m ) m = l...M , 

xex 

^ p(x)w n > (w n ) n = l...N . 



(1) 

(2) 
(3) 



xex 



Equations (dJ,© and (O constitute the constraint set. This 
can also be considered as the information available about 
the probability distribution. We shall denote a constraint 
set by C, and a subscript to distinguish between different 
constraint sets. 

Hence the task of divergence minimization can be viewed 
as, given a prior probability distribution q{x) and con- 
straint set C finding the probability distribution p m i„ such 

thatpmin — argmin/„(p||r). It can be easily verified that 

pec 

the constraint set C constitutes a convex set. We would like 
to inform that some of these notation have been borrowed 
from (20). 

Invariance of KL-divergence to coordinate transforma- 
tions enables us to generalize KL-divergence to continious 
random variables. We have observed that the invariance 
property holds true in the case of Tsallis divergence too. 

Proposition 1 (Invariance): Let T be a coordinate 
transformation from x G X to y G X' with (Tp)(y) = 
J^ 1 p(x), where J is the Jacobian J = d(y)/d(x). Let 
TX be the set of densities Tp corresponding to densities 
p G X. Let (TC) C (TX) correspond to C C X. Then, 
given a prior distribution r 



arff min I„(n\\Tr) — ara:min/„(s||r) , (4) 
perc H sec q 

and I q {Tp min \\Tr) = I q (p 

min \\r) , (5) 



hold, where Tp n 

arg min/ 9 (s||r). 

sec 



arg min I q (p\ \Tr) and p n 
perc 



Proof: We have (Tp)(y) — J 1 p(x), where J is the 
Jacobian J = d(y)/d(x). 

I \ 1 T ^ A 
p(x) ln„ —r^-o-x 

x P[x) 
= I q (p\\r) 

This proves (0. From (O it also follows that the minimum 
in rC corresponds to the minimum in C, which proves @. 



Proposition 2 (Subset Independence): Let 
Si , S*2 , . . . , S n be a partition of X. Let the 
new information C comprise about each of the 
conditional densities p(x/x G Si), i = l...n. 
Thus, C = Ci A C2 A • • • A C n , where d is the constraint 
set on the conditional densities of Si. Let Ai be the new 
information giving the probability of being in each of 
the n subsets, which is the constraint 



'E p(x) = mi, i = 1 ... ra 



xeSi 



where mi are known values. Then given the prior distri- 
bution r, 



Pcm(x/x £ = zrgmm I q (p\\ ri ), q G (0, 1) , (6) 



and 



^r(pSSill r ) = E to * hiPiWn) -Y.mi ln q -!- 

1 1 

Z — 1 i— 1 

+ (1 - ?)S ( m * ln <? — ^folki)) (7) 



hold, where 

p™ = arg p mw M I q (p\\r) , 

Pi(x)=p?$(x/xeS i ) , 
ri(x) = r(x/x G Si) , 

and Si are the prior probability of being in each subset, 
given by Si = J2 xe s t r ( x )- 
Proof: 



I q {PcM\\r) = -Y.H m ^ x ) H 



i=l a;GSi 



Si7-j(a;) 
mjr» (a;) 



Using the relation \n q (xy) 
q) ln g a; In, y, we get 



\n q x + ln q y + (1 - 

Pi 0*0 



m, pi(x) 



mJqiPiWri) 



+ a-ff)E 

i=i 



E m » m ? — 

2 — 1 



rrij In, — I q (pi\\n) 
rrii 



this proves (0. To prove (|6]i it may be noted that each 
of the terms ln ? — is a constant. Hence minimizing 
rhs of (0 is independent of the values taken by it. i.e 
for q G (0,1) minimizing I q (pcM\\ r ) * s equivalent to 
minimizing each of the terms, I q (pi\\ri). ■ 
Let us further analyze equation © and try to interpret it. 
What this means is that, given a system which naturally 
partitions into subsets, we can find the posterior densities 
in two different ways 

1) We can find the posterior Pcj$ and condition it on 
the different subsets 5,; or 

2) We can condition the prior r on the different subsets 
Si and use that as a prior to minimize in the 
constraint set Ci 

By (O both these approaches should give the same result. 

IV. Tsallis Divergence Minimization - 
Classical 

The task of minimization can be defined as follows: 
Minimize I q (p\\r) subject to the constraints 



x£X 



1 



(8) 



xex 



p(x) > , 
u m (x)p(x) = (u m ), m=l, 



,M 



By choosing the Lagrangian for the minimization problem 

as 



C = 



T,xexP( x ) 



•fM]«-' 

9-1 



r (£* e *Mz)-i) 



~ Em=l 1 X ^(J2 xe X U m (x)p(x) - (u m )) 

The distribution that we get after minimization is 



p(x) = r(x) 



A I 



A(i + (g-i) E ^ 

m u m (37) 



(9) 



Substituting © in © we get 



y 



G.Y 



r(x) ( 1 - (1 - q) PmU m (x) 



Substituting in (© we get 



p(x) = 
where 



r(x)(l-(l-q)J2m =1 PmU'm(x)) 9 



(10) 



(x)(l-(l-q) 



equation ( fTOb can be rewritten as 

r{x) 



M 

m— 1 



p(x) 



(11) 



Where Where exp g is exponentiation in g-deformed alge- 
bra 0~3]], and is defined as, 



exp 9 (x) 



using the relation 



_ / [l + (l-q)x] — 



if 1 + (1 - q)x > 
otherwise . 



, n r(x) 
p{x) = exp g 



exp, 



Em=l fim u m(x) 



we get 



. 1 - (1 - 9) Em=l feWmW 



(12) 

Note that we need an extra condition known as Tsallis 
cut-off condition to prevent negative values for p(x). We 
have assumed this condition to be implicit. 

V. Shore and Johnson Properties involving 

MAXIMUM ENTROPY 

In this section we shall discuss properties which de- 
pend on the formalism employed. 

Proposition 3 (Uniqueness): For q > given a prior, 
the posterior probability distribution is unique. 

Proof: For q > Tsallis divergence is a convex 
function, for both its parameter. Since the constraint set 
C is a convex set, the minimization is always unique. ■ 

Proposition 4 (Reflexiveness): For q > 0, given a 
prior r and constraint set C, the posterior obtained by 
minimizing the Tsallis divergence is same as r if and 
only if r £ C 

Proof: This property follows directly from the fol- 
lowing facts I q (p\\r) =0 iff p = r and I q (p\\r) > 
for q > 0. ■ 

Proposition 5 (Idempotence): Given a prior r and con- 
straint set C, let p be the posterior obtained, then 
ar g min I ~ q {u\\p) — p, i.e., taking the same information 



into account twice has the same effect as taking it into 

account once. 

Proof: This is a simple corollary of proposition |4] 

since p £ C the posterior obtained by taking p as prior 

and C as constraint, will also be p. ■ 

Proposition 6 (Invariance): Given a prior r consider 

the constraint sets C\ and C2, let p = arg min I q (u\ |r), 

ueCi 

then following relations hold 



p = arg min I q (u\\r) 
UGC1AC2 

= arg min I a (u\\p) 
UGC1AC2 

= arg min I q {u\\p) . 

UGC2 



(13) 
(14) 
(15) 



Proof: p £ Ci and p £ C2 hence p £ Ci A C2 so from 
proposition [4] both, ( fl"4"b and ( fl5l > follow. We know that 

p = arg min I q (u\\r) and p £ Ci A C2 from the above 

«GCi 

two, ( TBI ) follows. ■ 
The result shows that if the posterior obtained from C\ 
is an element of C2 then applying C2 on the posterior in 
different ways does not result in any change. 

Proposition 7 (Weak Subset Independence): 
Let Si, 1S2, . . . , S n be a partition of X. Let the 
new information C comprise about each of the 
conditional densities p(x/x £ st) 1 i = l...n. 
Thus, C = C\ A C2 A • • • A C n , where Ci is the constraint 
set on the conditional densities of Si. Then given the 
prior distribution r 



Pc in (x/x £ Si) = argmin/^pHr; 

pGC; 



« 6(0,1) 



(16) 



and 



I q (Pc m \V) 



i=l 



Ui Iq[pi\\ri) - 



1 = 1 



1 Si 

ln„ — 



hold where 



+ ( 1 _ X! ( Ui ln ? ~ I liPi\\ r i) 
z=l 



p^"=argmin/ g (p||r) , 
p gc 



(17) 



Pi (x) = p^ m (x/xeSi) , 
r, t (x) = r(x/x € Si) . 

Si are the prior probability of being in each subset, given 
by Si = J2 xe g. r(x), and Ui are the posterior probability 
of being in each subset, given by Ui — X^eS- Pc im { x )- 
Proof: Let 1Z be the information defined by the con- 
straint Xxgs P( x ) ~ Ui ' men ^ f°ll° ws from proposition 
ithat 

argmin/ 9 (p||r) = arg min I q (p\\r) . 
p GC p gcatc 



Now we can apply proposition [2] to get (fT6] l and ( fTTT i. 

■ 

This result is same as proposition |2] and has the same 
interpretation. This difference here lies in the fact that we 
do not have a prior information M regarding the total 
probability in each subset. 

Proposition 8 (Subset Aggregation): Let 
Si, 5*2, . . . , S« be a partition of X. Let Y be a 
transformation which converts a given distribution p to 
discrete distribution over Si, the transformation is defined 
by 

p'(xi) =Yp= p(x)dx , 
J St 

where X{ is a discrete state corresponding to x 6 Sj. Let 
C be the new information about the distribution Yp. Then 
for a given prior r, then 

r(a:/a; 6 Sj) = p mm (x/x e Sj) , (18) 

rPrnj/n, — FO^mm) , (19) 

and 7 g (rp min ||rr) = lg(p m i„||r) , (20) 

where p min = arg min I q (p\\r). 

per »(c') 

Proof: The constraint set C is defined by a set of 
expectations 

a 

^p'(x i )u m (x l ) = (u m ) m = l...M . 
i=i 

In terms of p — Yp' the constraint set can be represented 

as 

p(x)w m (x) = (u m ) m=l...M , 

x 

where w m is defined as 

w m (x) = u m (xi), for x e Si, i = 1 . . .n , 

i.e., w m is constant in each of the subsets Sj. 
From ( fTTb we get 



Pmin (x) 



-{x) 



Z exp q I - J2m=l PmW m {x 



(21) 



Since u; m is a constant within each subset S, and iz is a 
constant in itself. So equation d2"TT i reduces to: 

Pmin{x) = K t r(x) , 

where Ki is a constant for each subset. Now we have 

r(x/x £ Si) = r(x) / I r(y) 

— Pmin{x I X £ Si) . 



This proves (fT8l . 

Now consider the relation 



Iq(jPmin\\r) = ^2 u i Iq^JPiWi) ~^ u i ~ 
1=1 i—l 

+ (!-?) ( Ui ln ? ~ ^(ftlkO 



(22) 



which follows from ( TTTb - where 

Pi( x ) —Pmin( x /x £ Si) , 

T-j(ar) = r(.T/x G Si) , 
Sj = ^2 r ( x ) ' 

and Ui — p m in(x) . 
xeSi 

From ( TT~8T > we have that J5j(x) = r,(a;) and hence 
I q {Pi\\ r i) = 0. Now equation d22"b reduces to 



i=l 



^(rpmi„||rr) 



This proves (O and ( l20t . 



VI. Some Observations On Duality and 
Pythagoras 

A. Pythagorean Property 

Because of its extensive use in many problems, 
Pythagorean property is very important. It has been shown 
to exist for both second and third formalisms, involving 
q-expectation and normalized q-expectation respectively. 
In this section we have attempted to find the equivalent 
result for the classical expectation. The result we got is 
not promising but we present it here for future reference, 
and to introduce an alternative way to manipulate the 
Lagrange multipliers. Lets formally state our problem at 
hand: 

a) Problem statement : : Let r be the prior distri- 
bution and let p be the posterior got by minimizing the 
Tsallis divergence subject to the constraint set C 



E 



p(x)u m (x) = (u m ) m = 1 . . . M . 



Let I be another distribution satisfying the constraint 

l(x)u m (x) = (w m ) m = 1 . . . M . 



We are interested in finding the relation between (u m ) 
and (w m ) so as to minimize the divergence I q (l\\p). 



b) Solution: To find a solution to this problem 
we shall minimize the Tsallis divergence in a different 
manner. We start the minimization with the following 
Lagrangian 

C = 

H x z X P{*) m }-i~ X ~ (1 - ?A)(£ B6 *K*) - 1) 

+ T,m=l'lMT, xeX U m {x)p(x) - (U m )) , 

differentiating C with respect to p(x) and equating to 0, 
we get 



fr(x)\ ^ 



(23) 



Pmin — P(^) 



r{x) 



exp g (A - Em=l PmUm(x)) 



Multiplying equation d23l by p(x) and summing it over 
A" we get 



p(a;) /3 m u m (x) 



M 



~I™ n (p\\r) = \-Y,Pm{u m ) ■ 

m—1 

Differentiating I q nm (p\\r) with respect to (u m ) we get 



a 7™" 

Substituting 

/3 m = /?;„(1 + (1 - q)X) 
equation d23l reduces to 

'r(x) 

m—1 



(25) 



(26) 



p(x)J ^ 



p(x) 



r(x) 



Z Gy V q {-Y,Z=lPrnU m {x)) 



where Z = cxp (A). Hence equation d23l can be rewritten 



as 



ln„ 



r(x) 
p(x) 



M 



= ln q Z - p m u m {x) 



m—1 



Multiplying this equation p(x) and summing it over X 
we get 

M 

-r q im = \n q z- Y.Mum) . 



m—1 



Differentiating I q lm with respect to /3 m and equating to 
we get 



d ln„ Z 

"W = {Um) 



(27) 



Equations ( fZ5b and dZTb are the Legendre transform rela- 
tions. Given the relations and the divergence minimization 
let us look at the Pythagorean property. 
We want to minimize the divergence I q (l\\p). For this we 
will proceed as follows 



/,(l||r)-I,aib) = -X> 



r(x) pjx) 
lUq l{x) lUq l{x) 



(24) using the relation ln g = y 9_1 (ln 9 x — ln g y), we get 



I q (l\\r) - I q (l\\p) 

ire* 



In, 



7"(x) 



1 + (1 - q) In, 



P{x) 
l{x) 



using equation 

/ 9 (Z||r) -/,(%) 



/ M \ 

I A- (i m u m (x) 



l + (l-g)ln, 



p(x) 



9 

A - ^/3m(^m)-(l-g)A/ 9 (/|b) 



- (1 - ?) 5^ MO*) ln <? E /8m"m(») J 

V ' m=l / 



(28) 



The minimum of J g (Z||p) is achieved for 

Differentiating (f28T > we get 

<>X - (w m ) - (1 - q)I q (l\\p) d A 



00 



(l-q)J2«x) 



xEX 



A I 



l(x) ^ 

v ' m—1 



Using equation (|27l i we get 

(U m ) - (tV m ) - (U m ) Ig(l\\p) 

d 

l{x)- 



ln ? y^y ^ PmU m (x) 



xex 



P{x) 
l(x) 



m—1 v \ t / 



(29) 



Evaluating it further we by using the relations ln,j 



ln q a: — hi, y 
l+(l-g)ln, y 

(W m ) 



and p{x) 



r(x) 

A-E* = i/3 m « m (x) 



We get 



( Um )(i - i q (i\\p)) + - q)Pmi q (i\\p) 

l q (x) (u m (x) - (u m )) * 



£ r 

xex ^i + (i- q )\ nq L 



(30) 



where * = (1 + (1 - q) ln q r(x)) Y,m=i PmU m (x). Note 
that in this expression r(x) can be replaced in terms of 
p(x). 

Though this relation does not seem promising, we have 
mentioned it here for the sake of completion. 

B. Additive transformation - qf)2-q 

In q— deformed algebra there exists ago 2— q duality. 
Which is the following: 



ln g (l/x) = ln 2 - q (x) , 

exp (-X) = — r-r 

exp 2 - q {x) 



(31) 
(32) 



Using this duality Tsallis entropy has been well studied, 
i.e., various properties of S<z- q has been studied. Initial 
observations regarding £2-9 were made by Baldovin and 
Robledo [21 1. Naudts ll22l has further analyzed both the 
dualities. More study has been carried forward by Wada 
and Scarf one [23 1. they have found relations between 
the Lagrange multipliers of both the dualities. In this 
section we introduce a similar transformation for Tsallis 
divergence. 

Given a prior r and the constraints set C defined by 



xex 



1 



p(x) > , 
u m (x)p(x) = (u m ), m = l,...,M 



from equation ( fTTT i we have 

p(x) 



r(x) 



Z exp q I - J2m=i P m u m (x) 
and using the relation d32l it becomes 

~(x) exp 2 _ q ( J2™=i Pmu m (x) 



p(x) = 

Zj 

This form for the posterior is very good and is the basis 
for the gH2-g transformation. Note that 

2 - (2 - q) = q , 

i.e if we minimize l2- q (p\ \r) instead of I q (p\ \r), we have. 

p(x) = axgminl 2 _g(p||r) 
pec 



-(x) exp g ( 



x£X 



VII. Conclusion 

In this work we explored Shore and Johnson properties 
for Tsallis formalism of the third kind involving normal- 
ized g-expectation, it was observed that none of these 
properties hold for the formalism. Whereas in the study 
of first formalism involving classical expectation, we have 
been able to establish substantial number of Shore and 
Johnson properties. We were also able to establish a crude 
form of Pythagorean relation. We have also been found a 
q <-> 2 — q additive transformation, which gives a very 
good form for the posterior distribution. We conclude 
from these observations that the first formalism is of 
stronger theoretical and practical significance; and these 
results along with the 502-9 additive transformation 
also provides some ground work for definition of a power 
law family. 
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